Exploration of paired fastq files

In this exercise, we will download a pair of files in fastq.gz format and explore their contents. We will generate graphs to synthetically represent the sequencing quality as well as the composition of nitrogenous bases.

A. Libraries installation

Opening a command prompt

# Windows:
# Click on the start button and type `cmd` in the search bar. Press Enter.
# macOS:
# Press Command (⌘) + space to open Spotlight Search. Type "Terminal" and press Enter.

Biopython & Numpy & Matplotlib

# Windows and macOS: in your command prompt type:
python3 -m ensurepip --upgrade
python3 -m pip install --upgrade pip
python3 -m pip install Bio Numpy Matplotlib

# Linux: in your command prompt type:
sudo apt update
sudo apt install python3-pip
pip install Bio Numpy Matplotlib

B. Dowloading the files

The fastq files we will download are available on the ENA website. It is the European Nucleotide Archive, which is a comprehensive nucleotide sequence database, that includes raw sequencing data, sequence assembly information and functional annotation.

Go to the ENA database, and type the Saccharomyces cerevisiae in the search bar. You will get different blocks of results. In the Read block, click on the Run field and choose the first run accession that appears: SRR800768.

  1. What sequencing platform was used to do the sequencing run?

  2. What is the instrument model used?

  3. What is the library type used for the sequencing? Paired-end? Single-end or mate-pair?

  4. What is the type of molecules that have been sequenced?

A the bottom of the page, you can see two fastq file names.

  1. What is the extension of these files? What does it indicate?

  2. Are these files expected to be different in their contents? Justify your answer.

  3. Select both of the files and click on Get Download script.

A file will be downloaded. You can open it with a text editor (like notepad++ or Visual Studio Code). In this file you can find the following command lines that will allow you to download both files:

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR800/SRR800768/SRR800768_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR800/SRR800768/SRR800768_2.fastq.gz

Open a command prompt and paste the command lines in your command prompt.
If 'wget' is not recognized, replace 'wget -nc' by 'curl -O' and paste the command lines in your command prompt.

To simplify the subsequent analyses, a subsampling from each of SRR800768_1.fastq.gz and SRR800768_2.fastq.gz was done using the seqtk library. Subsampling was done in a way to keep reads pairing (same number of reads in subsampled files, in the same order). The subsampled files: SRR800768_1_sub.fastq and SRR800768_2_sub.fastq can be downloaded on icampus. We will use these files for the rest of the exercice.

  1. What are the sizes of both files?

C. Analyzing the files

Importing the installed libraries in Python

Open a Python terminal. Do not forget to type all your commands in a text file with a .py extension.
Here is the libraries that need to be imported at the beginning of your python script:

from Bio import SeqIO
import gzip
import matplotlib.pyplot as plt
import numpy as np

Reading fastq files

Make sure that the Python session's working directory aligns with the location of this fatsq files. For that you can use the Python library os as follows:

import os
os.getcwd()

In case your fastq files are not located in the Python current working directory, you have two options: either copy them into the Python current directory, or modify the Python current directory to that of the fastq files using the following code:

new_directory = "/path/to/your/new/directory" # better if absolute path
os.chdir(new_directory)

Then, you will have to read both files using the following syntax.

# For gzipped files
records = SeqIO.parse(gzip.open('filename.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
# For non-gzipped files
records = SeqIO.parse('filename.fastq', 'fastq')

# You can iterate over my file as follow:
for record in records:
  seq = record.seq
  phred_scores = record.letter_annotations['phred_quality']

  1. How many reads do we have in each of the two files? Verify with a Python script that they are correctly paired.
  2. What is the read length?
  3. Plot the reads per base content for each of the fastq files: the x axis is the position on the read (0 to read length - 1) and on the y axis the percentages of each base (A in red, T in green, C in blue and G in black).
  4. What do you notice regarding the relative quantity of the different bases? Is it an expected result? Justify your answer.
  5. The GC-content (or guanine-cytosine content) is the percentage of nucleobases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). It is computed as follows:

%GC=G+CA+T+G+C×100\%GC = \frac{G + C}{A + T + G + C} \times 100

Estimate the GC content of the Saccharomyces cerevisiae sequenced genome and compare it with the known GC content of the species. 6. For each file, generate a plot illustrating the median, first quartile (Q1) and third quartile (Q3) of the Phred scores for each position in the read. The x-axis represents the position on the read, and the y-axis depicts the median (in purple), Q1 (in blue), and Q3 (in red) of the Phred scores.

Tip: you can use numpy to compute the Q1, Q2 and Q3 of a list:

my_list = [1, 2, 3, 3, 3, 5, 8]
# Q2 = median
my_median = np.median(my_list)
# Q1
Q1 = np.percentile(my_list, 25)
# Q3
Q3 = np.percentile(my_list, 75)
  1. Is the quality of the bases homogeneous across the read? Propose a hypothesis explaining that.
  2. Suggest an algorithm for cleaning reads and implement it in Python. This algorithm should allow trimming low-quality bases at the ends of the reads and then keep reads whose length is above a threshold. As pairing may be disturbed by cleaning, 4 output files are expected:

Tip: Here is a code snippet to create and fill a fastq file:

with open(path/to/my_new_file, 'w') as f:
  for record in records :
    SeqIO.write(record, f, 'fastq')
  1. To check the validity of the cleaning algorithm, plot the qualities of the clean paired files as done in part C.6.